Adapting Standard Open-Source Resources To Tagging A Morphologically Rich Language: A Case Study With Arabic
نویسنده
چکیده
In this paper we investigate the possibility of creating a PoS tagger for Modern Standard Arabic by integrating open-source tools. In particular a morphological analyser, used in the disambiguation process with a PoS tagger trained on classical Arabic. The investigation shows the scarcity of open-source tools and resources, which complicated the integration process. Among the problems are different input/output formats of each tool, granularity of tag sets and different tokenisation schemes. The final prototype of the PoS tagger was trained on classical Arabic and tested on a sample text of modern standard Arabic. The results are not that impressive, only an accuracy of 73% is achieved. This paper however outlines the difficulties of integrating tools today and proposes ideas for future work in the field and shows that classical Arabic is not sufficient as training data for an Arabic tagger.
منابع مشابه
Arabic Morphosyntactic Raw Text Part of Speech Tagging System
Introduction and Overview: The topic of this dissertation is morphosyntactic part of speech tagging (abbreviated POS tagging) for Arabic. This topic has long and rich history for other languages, mainly for English. POS Tagging provides fundamental information about word forms used in sentences of natural language. The method of utilizing this information varies depending on the particular NLP ...
متن کاملConsidering a resource-light approach to learning verb valencies
Here we describe work on learning the subcategories of verbs in a morphologically rich language using only minimal linguistic resources. Our goal is to learn verb subcategorizations for Quechua, an under-resourced morphologically rich language, from an unannotated corpus. We compare results from applying this approach to an unannotated Arabic corpus with those achieved by processing the same te...
متن کاملTransforming Standard Arabic to Colloquial Arabic
We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text, and reduces the percentage of out-ofvocabulary words from 28.98% to 16.66%. The process holds promise for any NLP task targeting the dialectal varieties of Arabi...
متن کاملJoint Segmentation and POS Tagging for Arabic Using a CRF-based Classifier
Arabic is a morphologically rich language, and Arabic texts abound of complex word forms built by concatenation of multiple subparts, corresponding for instance to prepositions, articles, roots prefixes, or suffixes. The development of Arabic Natural Language Processing applications, such as Machine Translation (MT) tools, thus requires some kind of morphological analysis. In this paper, we com...
متن کاملGenetic approach for arabic part of speech tagging
With the growing number of textual resources available, the ability to understand them becomes critical. An essential first step in understanding these sources is the ability to identify the parts-of-speech in each sentence. Arabic is a morphologically rich language, which presents a challenge for part of speech tagging. In this paper, our goal is to propose, improve, and implement a part-of-sp...
متن کامل